The objective of this study is to investigate the critical factors that contribute to an individual’s appeal, popularity, and recognition within an online dating platform. The data utilised for this research is sourced from Lovoo, a prominent European dating application, and is accessible via Kaggle.
The underlying motivation for this study stems from the desire to comprehend behavioural patterns that transcend the confines of physical attractiveness. The aim is to unveil hidden determinants that may shape interpersonal interactions within a digital dating platform. The behaviour exhibited on these platforms carries significance, even in economic contexts. By deciphering this behavioural paradigm, it can potentially contribute to the development of economic models. These enhanced models can subsequently offer a more profound analytic framework to elucidate overall mate-selection behaviour.
The initial phase of the analysis involves the transformation of raw data into a more interpretable format. This includes the creation of additional variables tailored to augment the predictive capacity of the statistical models employed in subsequent stages. This phase facilitates the exploratory aspect of the research, enabling an in-depth examination of data in search of potential predictor variables. The objective extends beyond understanding the phenomena; the aim is to anticipate which factors instigate an increased number of profile views and, subsequently, the ‘likes’ received.
The modelling process is a two-step approach. The first stage focuses on identifying variables that may elucidate why individuals view a certain profile. Potential variables include online presence, age, geographical location, and the timing of an individual’s online activity. The second stage aims to identify factors that influence the likelihood of a profile receiving ‘likes’. These may include the number of pictures on a profile, the characteristics of a profile’s biography, languages spoken, profile verification status, and mobile usage.
The project employed decision tree models to analyze the intricate patterns influencing user behaviour. Decision trees offer a clear and comprehensible framework to identify the complex characteristics that impact outcomes. Using these models, the analysis yielded insightful findings on unique aspects of user engagement on the dating platform. Two decision tree models were developed, each focusing on predicting different facets of user behaviour, namely profile views and ‘likes’. This approach facilitated a deeper and more nuanced comprehension of the drivers behind these two critical indicators of user engagement.
The results indicated a strong correlation between certain variables like online presence, age, mobility, and timing of online activity with the number of profile views. However, profile views were found to be the only predictor that significantly influences the ‘likes’ received by a profile. Thus, these decision tree models effectively shed light on the variables that play pivotal roles in digital dating platforms, and as such in mate searching behaviour.
The dataset in consideration comprises 3 973 observations approximately 30 variables, each encapsulating specific attributes pertaining to individual profiles and related demographic information. An excerpt of the dataset is provided in Table 1, supplemented by Table 2 which provides more descriptives of a selection of significant variables. It’s noteworthy to mention that the dataset solely encompasses data of individuals identifying as female. As such, the core objective of this analysis is to discern the determinants influencing the behavioural patterns of individuals displaying interest in females.
| age | counts_details | counts_pictures | counts_profileVisits | counts_kisses | counts_g | flirtInterests_chat | verified | lang_count | lang_de | whazzup | freetext |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 25 | 1.00 | 4 | 8279 | 239 | 3 | TRUE | 0 | 1 | TRUE | Nur tote fische schwimmen mit dem strom | Nur tote Fisch schwimmen mit dem Strom |
| 22 | 0.85 | 5 | 663 | 13 | 0 | TRUE | 0 | 3 | TRUE | Primaveraaa<3 | NA |
| 21 | 0.00 | 4 | 1369 | 88 | 2 | FALSE | 0 | 0 | FALSE | NA | NA |
| 20 | 0.12 | 3 | 22187 | 1015 | 3 | TRUE | 0 | 2 | FALSE | Je pense donc je suis. Instagram quedev | NA |
| 21 | 0.15 | 12 | 35262 | 1413 | 12 | TRUE | 0 | 1 | TRUE | Instagram: JESSSIESCH | NA |
| Variable | Description |
|---|---|
| age | Age of the individual. |
| counts_details | How complete the profile is. Proportion of detail in the account. Measured from 0.0-1.0. |
| counts_pictures | How many pictures does the profile contain. |
| counts_profileVisits | How many times the profile has been viewed. |
| counts_kisses | Number of ‘kisses’ or ‘likes’ received by profile. |
| counts_g | Number of group interactions which could represent the number of times a user has been added to a group or mentioned in a group chat |
| flirtInterests_* | What the individual is interested in. ’*’ represents: ‘chat’, ‘date’, ‘friends’. |
| verified | Whether the profile has been verified or not. |
| lang_count | Number of languages spoken by an individual. |
| lang_* | Language spoken by an individual. ’*’ represents: ‘en’ (English), ‘de’ (German), ‘fr’ (French), ‘it’ (Italian), ‘es’ (Spanish). |
| whazzup/freetext | A set of phrases that represent the profile’s ‘bio’. |
| isMobile | Whether an individual can arrange transport for themselves. |
The original dataset is already quite useable, but we can produce better models by adding some new variables. The first step is to take a closer look at the language people use in their profiles. I am focusing on two main things here: the words used in the profile descriptions, and the use of emojis. Both of these could give insights into the person’s confidence and desirability.
I created two new dummy variables, has_emoji and
contains_popular_word. has_emoji attributes a
‘1’ based on whether wazzup or freetext
contains an emoji. contains_popular_word attributes a ‘1’
based on whether wazzup or freetext contains a
popular word. Figure 1 below shows which words are the most popular by
means of a word cloud. (The word cloud is a dynamic image that shows the
popularity when hovering over a specific word)
Interestingly, one may see that popular words (those are words that
get many profile views and are used frequently), are social media tags.
That is, individuals that have their social media details, such as their
Instagram handle, Facebook name, and Snapchat handle, on their account
tend to get more profile views. As such, I will create another variable,
called has_social that captures whether a profile contains
social media particulars. Due to endogeneity and possible
multicollinearity between contains_popular_word and
has_social, only one can be used in modelling. Whichever
delivers the most accurate result will then be used.
Another operation pertains to the time an individual is online. I add
a new dummy variable called night_owl to the dataframe
based on whether a person was online at night or not. The motivation
behind this is that dating apps tend to be more popular in the evenings,
than during daytime.
Then, I transformed continuous numerical data from ‘Profile Views’ and ‘Likes’ into categorical variables, specifically ‘Low’, ‘Low Mid’, ‘High Mid’, and ‘High’. This process, known as discretization, was achieved through quartile-based categorization, an effective method for handling large-range continuous data in predictive models.
Quartiles divide data into quarters, aiding in capturing non-linear effects and minimizing outlier impact. For my data, ‘Low’ signifies values below the first quartile (25th percentile), ‘Low Mid’ denotes values between the first quartile and the median, ‘High Mid’ represents values between the median and the third quartile (75th percentile), and ‘High’ encompasses values above the third quartile. Despite the potential for information loss due to a reduction in distinct values, this method simplifies model interpretation and robustly handles outlier influence, key factors for ensuring reliable results.
Finally, before kicking off with the exploratory data analysis, it
might be fitting to standardise the following three variables:
counts_profileVisits, counts_kisses, and
counts_pictures. This operation may help to more clearly
visualise how some variables influence each other.
This segment aims to identify underlying patterns and relationships within the dataset. An initial step involves visually inspecting the variables, helping to assess their potential relevance and impact on the outcomes of interest. As a fundamental part of exploratory data analysis, these visual inspections allows one to discern which features could be instrumental in shaping predictive models.
In order to illuminate the relationships between the variables, a
correlogram has been produced, which reveals some notable insights. For
instance, there is a strong correlation between
counts_kisses and counts_profileVisits (as
expected), as well as a strong correlation between
counts_g, counts_kisses and
counts_profileVisits. A slight negative correlation is
observed between profile views and factors such as age, interests
leaning towards ‘just friends’, and shareability of the profile. On the
contrary, having a verified status and showcasing multilingual abilities
are positively correlated with profile likes, signifying their potential
influence in enhancing a profile’s appeal. Interestingly, one of the
paid features of the app, isHighlighted (which highlights
your profile), shows no significant correlation with profile likes or
views.
The new variables created also show some positive relationships to profile likes and views. Specifically, bios containing emojis and social media tags have a correlation coefficient of 0.14 with profile likes. This points to there possible usefulness in predicting popularity.
As hinted in the introductory section, it quickly becomes apparent that specific variables have a more pronounced influence on the number of profile ‘Likes’, while others may largely dictate the number of ‘Profile Views’. This distinction is crucial, as certain profile elements only become observable once a profile is viewed. For instance, the information in a profile biography only comes into play during a profile view. Therefore, the dynamics of what draws views and subsequently encourages likes may differ significantly, although both are important aspects of profile engagement.
Interestingly, despite these differences, one notices a robust correlation between profile views and likes. This interplay implies that a successful profile is not just about attracting views but also about converting those views into likes. Figure 3 visually represents this relationship, further illuminating the interdependent nature of profile views and likes. Uncovering these patterns provides essential insights that can inform our subsequent modeling efforts.
Figure 4 below aims to present whether there is a difference in the
distribution of likes received based on the newly created dummy
variables, has_emoji, contains_popular_word,
and night_owl. There seem to be some slight differences in
likes received, supporting the idea that the use of emojis and certain
words do suggest higher levels of trust. Interestingly, being online
seems to be negatively associated with profile views as is shown in the
right panel of figure 4. However, usually being online during night time
also may increase profile views, but I view this variable more as a
control variable, rather than a causal one, as more people tend to be
online during night time than in day time.